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O ■ Abstract 

^| | In this work, we consider a variant of the classical Longest Com- 

mon Subsequence problem called Doubly-Constrained Longest Com- 
mon Subsequence (DC-LCS). Given two strings s% and S2 over an 
alphabet S, a set C s of strings, and a function C : £ — » iV, the 
DC-LCS problem consists in finding the longest subsequence s of si 
£^ ■ and S2 such that s is a supersequence of all the strings in C s and 

' such that the number of occurrences in s of each symbol a G £ is 

O . upper bounded by C Q (a). The DC-LCS problem provides a clear 

mathematical formulation of a sequence comparison problem in Com- 
putational Biology and generalizes two other constrained variants of 
' the LCS problem: the Constrained LCS and the Repetition-Free LCS. 

We present two results for the DC-LCS problem. First, we illustrate 
a fixed-parameter algorithm where the parameter is the length of the 
. solution. Secondly, we prove a parameterized hardness result for the 

Constrained LCS problem when the parameter is the number of the 
constraint strings (|C S |) and the size of the alphabet E. This hardness 
' result also implies the parameterized hardness of the DC-LCS prob- 

. lem (with the same parameters) and its NP-hardness when the size of 

the alphabet is constant. 

X 

H ; 1 Introduction 

The problem of computing the longest common subsequence (LCS) of two 
sequences is a fundamental problem in stringology and in the whole field of 
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algorithms, as it couples a wide range of applications with a simple math- 
ematical formulation. Applications of variants of LCS range from Compu- 
tational Biology to data compression, syntactic pattern recognition and file 
comparison (for instance it is used in the Unix diff command). 

A few basic definitions are in order. Given two sequences s and t over 
a finite alphabet S, s is a subsequence of t if s can be obtained from t by 
removing some (possibly zero) characters. When s is a subsequence of t, 
then t is a supersequence of s. Given two sequences s% and S2, the longest 
common subsequence problem asks for a longest possible sequence t that is 
a subsequence of both s\ and S2- 

The problem of computing the longest common subsequence of two se- 
quences has been deeply investigated and polynomial time algorithms are 
well-known for the problem [11]. It is possible to generalize the LCS prob- 
lem to a set of sequences: in such case the result is a sequence that is a 
subsequence of all input sequences. The problem is NP-hard even on bi- 
nary alphabet [10] and it is not approximable within factor 0(n 1_e ), for any 
constant e > 0, on arbitrary alphabet [9]. 

Computational Biology is a field where several variants of the LCS prob- 
lem have been introduced for various purposes. For instance researchers 
defined some similarity measures between genome sequences based on con- 
strained forms of the LCS problem. More precisely, it has been studied 
an LCS-like problem that deals with two types of symbols (mandatory and 
optional symbols) to model the differences in the number of occurrences 
allowed for each gene [U [2]. An illustrative example is the definition of 
repetition-free longest common subsequence [2] where, given two sequences 
s\ and S2, a repetition- free common subsequence is a subsequence of both 
s\, S2 that contains at most one occurrence of each symbol. Such a model 
can be useful in the genome rearrangement analysis, in particular when deal- 
ing with the exemplar model. In such framework we want to compute an 
exemplar sequence, that is a sequence that contains only one representative 
(called the exemplar) for each family of duplicated genes inside a genome. 
In biological terms, the exemplar gene may correspond to the original copy 
of the gene, from which all other copies have been originated. 

A different variant of LCS that has been introduced to compare bio- 
logical sequences is called Constrained Longest Common Subsequence [13] . 
More precisely, such variant of LCS can be useful when comparing two bi- 
ological sequences that have a known substructure in common [15] . Given 
two sequences s±, S2, and a constraint sequence s c , we look for a longest 
common subsequence s of s\, S2, such that s c is a subsequence of s. The 
constrained LCS problem admits polynomial-time algorithms [T3] [3J but 
it becomes NP-hard when generalized to a set of input sequences or to a set 
of constraint sequences [8]. 

In this paper we introduce a new problem, called Doubly-Constrained 
Longest Common Subsequence and denoted as DC-LCS, that extends both 
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the repetition-free longest common subsequence problem and the constrained 
longest common subsequence problem. More precisely, given two input se- 
quences si, S2, the DC-LCS problem asks for the longest common subse- 
quence s that satisfies two constraints: (i) the number of occurrences of 
each symbol a is upper bounded by a quantity C (a), and (ii) s is a su- 
persequence of the strings of a specified constraint set. First, we design a 
fixed-parameter algorithm [7] when the parameter is the length of the so- 
lution. Then we give a parameterized hardness result for the Constrained 
Longest Common Subsequence, when the number of constraint sequences 
and the size of the alphabet are considered as parameters. This result im- 
plies the same parameterized hardness result of DC-LCS. 

2 Basic Definitions 

Let si, S2 be two strings over an alphabet E. Given a string s, we denote 
by s[i] the symbol at position i in string s, and by s[i . . .j], the substring 
of s starting at position i and ending at position j. A string constraint Cs 
consists of a set of strings, while an occurrence constraint C Q is a function 
C : E — > N, assigning an upper bound on the number of occurrences of each 
symbol in E. First, consider the following variant of the LCS problem. 

Problem 1. CONSTRAINED LONGEST COMMON SUBSEQUENCE (C-LCS) 
Input: two strings s\ and S2, a string constraint C s . 

Output: a longest common subsequence s of s\ and s%, so that each string 
in C s is a subsequence of s. 

The problem admits a polynomial time algorithm when C s consists of a 
single string |15l O [5] , while it is NP-hard when C s consists of an arbitrary 
number of strings |s] . In the latter case, notice that C-LCS cannot be 
approximated, since a feasible solution for the C-LCS problem must be a 
supersequence of all the strings in the constraint C s and computing if such 
a feasible solution exists is NP-complete [§J . 

Problem 2. Repetition-free Longest Common Subsequence (RF-LCS) 
Input: two strings S\ and S2- 

Output: a longest common subsequence s of s\ and S2, so that s contains 
at most one occurrence of each symbol a € E. 

The problem is APX-hard even when each symbol occurs at most twice 
in each of the input strings s\ and S2 [2j. A positive note is that allowing 
at most k occurrences of each symbol in each of si and S2 results in a r- 
approximation algorithm [2]. 

We can introduce an even more general version of both the C-LCS and 
RF-LCS problem, called Doubly- Constrained Longest Common Subsequence 
(DC-LCS) problem. 
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Problem 3. Doubly- Constrained Longest Common Subsequence 
(DC-LCS) 

Input: two strings s± and S2, a string constraint C s , and an occurrence 
constraint C a . 

Output: a longest common subsequence s of s\ and S2, so that each string 
in C s is a subsequence of s and s contains at most C Q (a) occurrences of each 
symbol a € X. 

It is easy to see that C-LCS problem is the restriction of the DC-LCS 
problem when C Q (a) = \si \ + | S2 1 for each a G E. At the same time, the 
RF-LCS problem is the restriction of the C-LCS problem when C s = and 
C (c) = 1 for each a £ E. Therefore the DC-LCS problem is APX-hard, 
since it inherits all hardness properties of C-LCS and RF-LCS. 

3 A Fixed- Parameter Algorithm for DC-LCS 

Initially we present a fixed-parameter algorithm for the DC-LCS problem 
when \C S \ < 1 (hence the result holds also for the RF-LCS problem), where 
the parameter is the size of a solution of DC-LCS. Later on, we will extend 
the algorithm to a generic set C s . 

The algorithm is based on the color coding technique pQ. We recall the 
basic definition of perfect family of hash functions |14j . Given a set S, a 
family F of hash functions from S to {1, 2, . . . , k} is called perfect if for any 
S' C S of size k, there exists an injective hash function / 6 F from S' to 
the set of labels {1,2,..., k}. 

Since \C S \ < 1, we denote by s c the only sequence in C s . Let k be 
the size of a solution for DC-LCS, and recall that a solution contains at 
most C Q (a) occurrences of each symbol a 6 E. Notice that, since s is 
a subsequence of both si and S2, and by the definition of C Q , the num- 
ber of occurrences of each symbol a € E in a solution s is also (upper) 
bounded by the number of occurrences of a in each s% and S2 (i.e. occ(a, s) < 
min{ C (a), occ(a, s\), occ(a, S2)}). Let be a function from £ to N defined 
as C' (a) := min{C (<r), occ(a, si), occ(a, 52)}- 

Given and the sequences s\ and S2, we construct a set S that contains 
the pairs (a, i) for each a £ S and i 6 {1, C^ct)}. For example, if si = 
aaaabbbccd, S2 = ddcbbbbaaaa, and C (a) = C (6) = C G (c) = C (d) = 3, 
then the set S is equal to {(a, 1), (a, 2), (a, 3), (6, 1), (b, 2), (b, 3), (c, 1), (rf, 1)}. 

Consider now a perfect family i 7 of hash functions from S to the set 
{1,2 . . . , k}. We can associate a function I : E — ► 2^ 1,2 with each f € F, 
where /(c) = {/(a, i) : (c, «) G E}. Let s be a solution of the DC-LCS 
problem of length at most k, and let L be a subset of {1, . . . , k}. Then s 
is an L-colorful solution w.r.t. a hash function f £ F (and its associated 
function I) if and only if there exists a function l\ : E — » 2^ 1,2 '"' ,fc ^ which 
satisfies the following conditions: 
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(i) Vcj G £, Zi(o-) C 1(a) HL, 



(ii) Vct £ E, |Zi (cr) | is equal to the number of occurrences of a in s, 

(iii) V(7i,(T 2 e£, li(tri) n/i(tr 2 ) = 0. 

Intuitively, an L-colorful solution s is a sequence such that it is possible to 
associate distinct elements (labels) of the set L with all the characters of s 
by using the function I. Notice that the length of an L-colorful solution s is 
equal to the number of labels that s uses, and each symbol a does not occur 
more than C' (a) times in s. 

The basic idea of our algorithm is to verify if there exists an L-colorful 
solution that uses all labels in L or, equivalently, if the length of an optimal 
L-colorful solution is |L|. Such task is fulfilled via a dynamic programming 
recurrence. Since F is a perfect family of hash functions, for each feasible 
solution s of length k, there exists a hash function / G F such that s is 
{1, . . . , fe}-colorful w.r.t. /. Therefore, by computing the recurrence for all 
hash functions of F, we are guaranteed to find a solution of length k, if such 
a solution exists. 

Given a hash function /, we define V[i,j,h,L] which takes value 1 if 
and only if there exists an L-colorful common subsequence s of si[l . . .i] 
and s 2 [l . . . j], such that s is a supersequence of s c [l . . . h] and s has length 
equal to [L| (or, equivalently, s uses all labels in L). Notice that the actual 
supersequence can be computed by a standard backtracking technique. The- 
orem [XT] states that V[i,j, h, L] can be computed by the following dynamic 
programming recurrence which is an extension of the standard equation for 
the Longest Common Subsequence (LCS) problem [6]. 



V[i,j,h,L] 



max < 



V[i-l,j,h,L] 
V[i,j-l,h,L] 
V[i-l,j-l,h,L\{\}) 



V[i-l,j-l,h-l,L\{\}] 



if si[i] = s 2 [j] and 
A G Ln/(si[i]) 
if s ±[i] = s 2 [j] = s c [h] and 
A G LnZ(si[£]) 

(1) 

The boundary conditions are V[0,j,h,L] = and V[i, 0, h, L] = if 
L / 0, while V[i,j,O,0] = 1, and V[i,j,h,0] = when h > 0. More- 
over, notice that, as a consequence of the recurrence's definition, we have 
V[i,j, h, L] = for all h > |L|. A feasible solution of length k is {1, . . . , k}- 
colorful w.r.t. / if and only if V[|si | , |s 2 1 , \ s c \ , {1, . . . , A;}] = 1. In this case, a 
standard backtracking search can reconstruct the actual solution. 

Theorem 3.1. Let f G F be a hash function mapping injectively the solution 
s to the set of labels {1, . . . , k}. Then Equation (0) is correct. 
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Proof. We will prove the theorem by induction, that is we will prove the 
correctness of the value in V[i a ,j a , h a , L a ] by assuming that of V[ib,jb, Lj,] 
when ib < i a , jb < ««, hb < h a , Lb C L a , and at least one inequality is strict. 

Let s be an optimal L a -colorful solution for the sequences si[l, ... ,i a ], 
S2[l; • • • j ja], s c [l, . . . , h a ], and let (3 be the last symbol of s, that is s = t(3, 
where t is the prefix of s consisting of all but the last character. 

If a ^ (3 then, just as for the recurrences of the standard LCS problem [6], 
the theorem holds. 

Therefore we can assume now that a = (3. Since s is L a -colorful, then 
there exists a mapping 1% satisfying the definition of L-colorfulness. By 
condition (ii), is equal to the number of occurrences of (3 in s. Let 

z be the label which is image through / of the last character of s. Then 
there exists an L \ {z}-colorful solution t of si[l, . . . ,i a — 1], S2[l> ■ ■ ■ , ja — 1], 
s c [l, . . . ,j a ] (if t is a supersequence of s c [l, . . . ,j a }) or of si[l, ... ,i a - 1], 
S2[lj • • • j ja — 1], s c [l, ... , ja — 1] , hence completing the proof. □ 

If / is a hash function that does not map injectively the solution s of 
length k to the set of labels {1, . . . , k} then, by definition of hash func- 
tion, there is a label z G {1, . . . , k} that is not in the image through / 
of any character of s. The latter observation also implies that z is not in 
the image through I of any symbol, therefore for each set L including z, 
the last two cases of our recurrence equation cannot apply, which implies 
that V[i,j, h, {1, . . . , k}] = for all values of i, j, h, hence establishing the 
correctness of our algorithm. 

It is immediate to notice that the total number of entries of the matrix 
V[-, •, •, •] is |si||s2||'S c |2' c . Furthermore notice that computing each entry re- 
quires at most 0(k) time, as case 1 and case 2 of the recurrence require 
constant time, while case 2 and case 4 require at most 0(k) time, since 
\L\ < k. Since there exists a perfect family of hash functions whose size is 
0(log |S|)2°( fc ) and that can be computed in 0(|S| log |S|)2°( fc ) time [1 . and 
|S| < the algorithm has an overall 0(|si| log |si|)2 ( fc )+0(|si||s2||s c |^2 A: ) 
time complexity. 

The algorithm actually computes a longest supersequence of s c that is a 
feasible solution of the problem. Assume now that C s is a generic occurrence 
set, and let x be an optimal solution of a generic instance of the DC-LCS 
problem of size k. It is immediate to notice that, by removing from x all 
symbols that are not also in one of the sequence of C s , we obtain a common 
supersequence x% of C s that is a subsequence of x. Moreover, as x has size 
k, x\ contains at most k characters (where k is the length of an optimal 
solution). 

Notice that the alphabet consisting of the symbols appearing in at least 
one sequence of C s contains at most k symbols, for otherwise all superse- 
quences of C s would be longer than k. Consequently there are at most k k 
such supersequences. Our algorithm for a generic C s enumerates all such 
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supersequences s c , and applies the algorithm for \C a \ = 1 on the new set of 
constraint sequences made only of s c , returning the longest feasible solution 
computed. 

The overall time complexity is clearly O (fc fc (|si| log |si|)2°( fe ) + |si||s2||s c |fc2 fc )). 

4 W[l]-hardness of C-LCS 

In this section we prove that computing if there exists a feasible solution 
of C-LCS is not only NP-complete, but it is also W[l]-hard when the pa- 
rameter is the number of string in C s and the alphabet E (see [7] for an 
exposition on the consequences of W[l]-hardness). 

We reduce the Shortest Common Supersequence (SCS) problem param- 
eterized by the number of input strings and the size of alphabet E, which 
is known to be W[l]-hard [12]. Let R = {r%, . . . , r^} be a set of sequences 
over alphabet E, hence R is a generic instance of the SCS problem. In what 
follows we denote by / the size of a solution of the SCS problem. 

The input of the C-LCS consists of two sequences si, S2, and a string 
constraint C s . Let # be a delimiter symbol not in E. Moreover, given a se- 
quence Ti = yi2/2 '"Vz over alphabet E, let c(rj) be the sequence 2/i#2/2# • • • #2/z#- 
Pose C s = {# 1 } U {c(r») : r» € R}, let w be a sequence over E such that w 
contains exactly one occurrence of each symbol in E, and let rev(w) be the 
reversal of w. Finally, let s\ = {wjff and S2 = (rev(w)#) . In the following 
we call each occurrence of w or of rev(u>) a block. 

Let t be any supersequence of that is also a common subsequence 
of s\ and S2- Since in each of those sequences there are I #s, then also 
t must contain I #s, which in turn implies that, by construction of w, at 
most one symbol of each block can be in t. Therefore t contains at most 21 
symbols. At the same time, let p be a generic sequence no longer than 21, 
ending with a # and such that no two symbols from E appear consecutively 
in p. Since each symbol of E occurs exactly once in w, it is immediate to 
notice that p is a common subsequence of si and S2- Consequently, the set 
of all supersequences of # l that are also common subsequences of s\ and S2 
is equal to the set of sequences q with length not larger than 21 and such 
that (i) q contains exactly I #s, (ii) q ends with a and (iii) taken two 
consecutive symbols from q, at least one of those symbols is equal to 

An immediate consequence is that there exists a feasible solution of 
length 21 of the instance of C-LCS made of the set C s and the two se- 
quences s\ and S2 iff there exists a supersequence of length 21 of the set R 
of sequences. 

The reduction described is an FPT-reduction [7\. Finally, notice that 
the W[l]-hardness of C-LCS with parameters \C S \ and |E| implies the W[l]- 
hardness of DC-LCS with parameters \C S \ and |E| since C-LCS is a restric- 
tion of the DC-LCS problem. 
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Moreover, notice that the same reduction can be applied starting from 
the SCS problem over binary alphabet, implying that the DC-LCS problem 
is NP-hard over a fixed ternary alphabet, as the SCS problem is NP-hard 
over a binary alphabet [13]. 
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